Structured Co-reference Graph Attention for Video-grounded Dialogue

نویسندگان

چکیده

A video-grounded dialogue system referred to as the Structured Co-reference Graph Attention (SCGA) is presented for decoding answer sequence a question regarding given video while keeping track of context. Although recent efforts have made great strides in improving quality response, performance still far from satisfactory. The two main challenging issues are follows: (1) how deduce co-reference among multiple modalities and (2) reason on rich underlying semantic structure with complex spatial temporal dynamics. To this end, SCGA based Resolver that performs dereferencing via building structured graph over modalities, Spatio-temporal Video Reasoner captures local-to-global dynamics gradually neighboring attention. makes use pointer network dynamically replicate parts sequence. validity proposed demonstrated AVSD@DSTC7 AVSD@DSTC8 datasets, benchmarks, TVQA dataset, large-scale videoQA benchmark. Our empirical results show outperforms other state-of-the-art systems both extensive ablation study qualitative analysis reveal gain improved interpretability.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Spatio-Temporal Attention Models for Grounded Video Captioning

Automatic video captioning is challenging due to the complex interactions in dynamic real scenes. A comprehensive system would ultimately localize and track the objects, actions and interactions present in a video and generate a description that relies on temporal localization in order to ground the visual concepts. However, most existing automatic video captioning systems map from raw video da...

متن کامل

A Structured Distributional Semantic Model for Event Co-reference

In this paper we present a novel approach to modelling distributional semantics that represents meaning as distributions over relations in syntactic neighborhoods. We argue that our model approximates meaning in compositional configurations more effectively than standard distributional vectors or bag-of-words models. We test our hypothesis on the problem of judging event coreferentiality, which...

متن کامل

Grounded Semantics as Persuasion Dialogue

In the current work, we provide a formal Mackenzie-style persuasion dialogue for grounded semantics. We show that an argument is in the grounded extension iff the proponent is able to persuade a maximally sceptical opponent in the dialogue.

متن کامل

Grounded Objects and Interactions for Video Captioning

We address the problem of video captioning by grounding language generation on object interactions in the video. Existing work mostly focuses on overall scene understanding with often limited or no emphasis on object interactions to address the problem of video understanding. In this paper, we propose SINet-Caption that learns to generate captions grounded over higher-order interactions between...

متن کامل

Facial Expression Grounded Conversational Dialogue Generation

We present a novel conversational language model that is grounded with information about facial expressions. To our knowledge this is the first in-depth examination of grounding natural language models with facial cues. We train a neural language model that uses automatically detected facial action unit intensity information in images alongside text to generate conversational dialogue. We evalu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2021

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v35i2.16273